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Abstract 

Motivated by the continuing interest in discrete time hidden Markov models (HMMs), 
this paper reexamines these models using a risk-based approach. Simple modifications of 
the classical optimization criteria for hidden path inference lead to a new class of hidden 
path estimators. The estimators are efficiently computed in the usual forward-backward 
manner and a corresponding dynamic programming algorithm is also presented. A partic- 
ularly interesting subclass of such alignments are sandwiched between the most common 
maximum a posteriori (MAP), or Viterbi, path estimator and the minimum error, or point- 
wise maximum a posteriori (PMAP), estimator. Similar to previous work, the new class is 
parameterized by a small number of tunable parameters. Unlike their previously proposed 
relatives, the new parameters and class are more explicit and have clear interpretations, and 
bypass the issue of numerical scaling, which can be particularly valuable for applications. 

Keywords: risk, HMM, hybrid, interpolation, MAP sequence, Viterbi algorithm, symbol- 
by-symbol, posterior decoding 

1. Introduction 

Besides their classical and traditional applications in signal processing and communica- 
tions (Bahl et al., 1974; Brushe et al., 1998; Hayes et al., 1982; Viterbi, 1967) (cf. also fur- 
ther references in (Cappe et al., 2005)) and speech recognition (Huang et al., 1990; Jelinek, 
2001, 1976; McDermott and Hazen, 2004; Ney et al., 1994; Padmanabhan and Picheny, 
2002; Rabiner and Juang, 1993; Rabiner et al., 1986; Shu et al., 2003; Steinbiss et al., 
1995; Strom et al., 1999), hidden Markov models have recently become indispensable in 
computational biology and bioinformatics (Brejova et al., 2008; Burge and Karlin, 1997; 
Durbin et al., 1998; Eddy, 2004; Krogh, 1998; Majoros and Ohler, 2007) as well as in natu- 
ral language modelling (Ji and Bilmes, 2006; Och and Ney, 2000) and information security 
(Mason et al, 2006). 
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At the same time, their spatial extensions, known as hidden Markov random field mod- 
els (HMRFM), have also been immensely influential in spatial statistics (Besag and Green, 
1993; Green and Richardson, 2002; Kiinsch et al., 1995; Mcgrory et al., 2009), and partic- 
ularly in image analysis, restoration, and segmentation (Besag, 1986; Geman and Geman, 
1984; Li et al., 2000; Marroquin et al., 2003; Winkler, 2003). Indeed, hidden Markov mod- 
els are called 'one of the most successful statistical modelling ideas that have [emerged] in 
the last forty years' (Cappe et al., 2005). 

HM(RF)Ms owe much of their success on the one hand to the persistence of the Markov 
property of the unobserved, or hidden, layer in the presence of observed data, and on the 
other, to the richness of the observed system (Kiinsch et al., 1995). Namely, in Bayesian 
terms, in addition to the prior, the posterior distribution of the hidden layer also pos- 
sesses a Markov property (albeit generally inhomogeneous even with homogeneous priors), 
whereas the marginal law of the observed layer can still include global, i.e. non-Markovian, 
dependence. 

The Markov property of the posterior distribution and the conditional independence 
of the observed variables given the hidden ones, have naturally led to a number of com- 
putationally feasible methods for inference about the hidden realizations as well as model 
parameters (if any). HMMs are naturally a special case of graphical models (Lauritzen, 
1996), (Bishop, 2006, ch. 8). 

HMMs, or one dimensional HMRFMs, have been particularly popular not least due 
to the fact that linear order of the indexing set (usually associated with time) makes ex- 
ploration of hidden realizations relatively straightforward from the computational view- 
point. In contrast, higher dimensional HMRFMs generally require approximate, possi- 
bly stochastic, techniques in order to compute optimal configurations of the hidden field 
(Cocozza-Thivent and Bekkhoucha, 1993; Joshi et al., 2006; Mcgrory et al., 2009; Winkler, 
2003). In particular, maximum a posteriori (MAP) estimator of the hidden layer of an 
HMM is efficiently and exactly computed by a dynamic programming algorithm bearing the 
name of Viterbi, whereas a general higher dimensional HMRFM would commonly employ 
a simulated annealing type method (Geman and Geman, 1984; Winkler, 2003) to produce 
approximate solutions to the same task. 

1.1 Notation and main ingredients 

We adopt the machine and statistical learning convention and therefore refer to the hidden 
and observed processes as Y and X, respectively, in effect reversing the convention that 
is more commonly used in the HMM context. Thus, let Y = {lt}t>i be a Markov chain 
with state space S = {1, . . . , K}, K > 1. Even though we include inhomogeneous chains 
in most of what follows, for brevity we will still be suppressing the time index wherever 
this does not cause ambiguity. Hence, we write P = (pij)ij£S f° r all transition matrices. 
Let X = {X t }t>i be a process with the following properties. First, given {Y t } t >i, the 
random variables {Xt}t>i are conditionally independent. Second, for each t = 1,2,..., 
the distribution of X t depends on {Y t }t>i (and t) only through Y t . The process X is 
sometimes called the hidden Markov process (HMP) and the pair (1", X) is referred to as a 
hidden Markov model (HMM). The name is motivated by the assumption that the process 
Y (sometimes called a regime) is generally non-observable. The conditional distribution of 



Generalized risk-based path inference in HMMs 



X\ given Y\ = s is called an emission distribution, written as P s , s G S. We shall assume 
that the emission distributions are defined on a measurable space (X, B), where X is usually 
1R and B is the corresponding Borel a-algebra. Without loss of generality, we assume that 
the measures P s have densities f s with respect to some reference measure A, such as the 
counting or Lebesgue measure. 

Given a set A, integers m and n, m < n, and a sequence 01, 02, • • ■ G -4°°, we write a^ 
for the subsequence (a m , . . . , a n ). When m = 1, it will be often suppressed. 

Thus, x T := (xi, . . . , xt) and y T := (j/i, . . . , yr) stand for the fixed observed and unob- 
served realizations, respectively, of the HMM (X t ,Y t )t>i up to time T > 1. Any sequence 
s T G S T is called a pai/t. We shall denote by p(x T , y T ) the joint probability density of 
(x T ,y T ), i.e. 

T 



p(^,^):=p(y r = /)n/, t (^)- 



t=l 

Overloading the notation, for every s G S and for every sequence of observations x , 
let p(s ) and p(x ) stand for the marginal probability mass function P(V = s ) of path 
s T and probability density function ^2 s t &s t p{x T , s T ) of the data x T , respectively. It is 
a standard (see, e.g. (Cappe et al., 2005; Ephraim and Merhav, 2002), (Bishop, 2006, ch. 
13)) in this context to define the so-called forward and backward variables 

atis):=pix < ]Yt= s)P(Y, = s). A(«):={^ r+i | Ft = s) , IT/t • (1) 

where p(x t \Yt = s) and p(xf +1 \Yt = s) are the conditional densities of the data segments x l 
and xf +1 , respectively, given Y t = s. 

1.2 Segmentation 

Segmentation here refers to estimation of the hidden path y T . Treating y T as missing 
data (Rabiner, 1989), or parameters, a classical and by far the most popular solution to 
the segmentation problem is to maximize p(x T ,s T ) in s T G S T . Often, especially in the 
digital communication literature (Brushe et al., 1998; Lin and Costello Jr., 1983), p(x T , s T ) 
is called the likelihood function which might become potentially problematic in the presence 
of any genuine model parameters. Such "maximum likelihood" paths are also called Viterbi 
paths or alignments after the Viterbi algorithm (Rabiner, 1989; Viterbi, 1967) commonly 
used for their computation. If p(s t ) s t £S t is thought of as the prior distribution of Y T , 
then Viterbi path also maximizes p(s T \x T ) := P(Y T = s T \X T = x T ), the probability mass 
function of the posterior distribution of Y T , hence the term 'maximum a posteriori (MAP) 
path\ 

In spite of its computational attractiveness, Viterbi inference may be unsatisfactory for 
a number of reasons, including its suboptimality with regard to the number of correctly 
estimated states yt- Also, using the language of information theory, there is no reason to 
expect a Viterbi path to be typical (Lember and Koloydenko, 2010). Indeed, "there might 
be many similar paths through the model with probabilities that add up to a higher proba- 
bility than the single most probable path" (Kail et al., 2005). The fact that a MAP estimate 
need not be representative of the posterior distribution has also been recently discussed in 
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a more general context in (Carvalho and Lawrence, 2008). Atypicality of Viterbi paths 
particularly concerns situations when estimation of y T is combined with inference about 
model parameters, e.g. transition probabilities pij (Lember and Koloydenko, 2010). Even 
when estimating, say, the probability of heads from independent tosses of a biased coin, 
we naturally hope to observe a typical realization and not the constant one of maximum 
probability. 

An alternative and very natural way to estimate y T is by maximizing the posterior prob- 
ability pt(s\x T ) := P (Y = s\X T = x T ) of each individual hidden state Yf, 1 < t < T. We 
refer to the corresponding estimator as pointwise maximum a posteriori (PMAP). PMAP 
is well-known to maximize the expected number of correctly estimated states (Section 2), 
hence the characterization 'optimal accuracy 1 (Holmes and Durbin, 1998). In statistics, es- 
pecially spatial statistics and image analysis, this type of estimation is known as Marginal 
Posterior Mode (Winkler, 2003) or Maximum Posterior Marginals (Rue, 1995) (MPM) es- 
timation. In computational biology, this is also known as the posterior decoding (PD) 
(Brejova et al., 2008) and has been reported to be particularly successful in pairwise se- 
quence alignment (Holmes and Durbin, 1998). In the wider context of biological applica- 
tions of discrete high-dimensional probability models, this has also been called consensus 
estimation, and in the absence of constraints, centroid estimation (Carvalho and Lawrence, 
2008). In communications applications of HMMs, largely influenced by (Bahl et al., 1974), 
the terms 'optimal symbol-by-symbol decoding'' (Hayes et al., 1982), 'symbol-by-symbol MAP 
estimation' (Robertson et al., 1995), and 'MAP state estimation'' (Brushe et al., 1998) have 
been used for this. 

Although optimal in the sense of maximizing the expected number of correctly esti- 
mated states, a PMAP path might at the same time have low, in principle zero, prob- 
ability (Rabiner, 1989). It is actually not difficult to constrain the PMAP decoder to 
admissible paths, i.e. of positive posterior probabilities as described in (Kail et al., 2005) 
(albeit in a slightly more general form allowing for state aggregation) and also in Sub- 
section 2.2, (7), below. A variation on this idea has been applied in (Fariselli et al., 
2005) for prediction of membrane proteins, giving rise to the term 'posterior Viterbi de- 
coding (PVDy (Fariselli et al., 2005). PVD, however, maximizes the product Y\t=iPt( s \ x ) 
(Fariselli et al., 2005) (and also (10) below) and not the sum ^2 t= iPt{s\x T ), whereas the 
two criteria are no longer equivalent in the presence of path constraints (Subsection 2.2). 
In (Holmes and Durbin, 1998), a PMAP decoder is proposed to obtain optimal pairwise 
sequence alignments. The authors of (Holmes and Durbin, 1998) use the term "a legiti- 
mate alignment" which suggests admissibility, but the description of the actual algorithm 
(Holmes and Durbin, 1998, Section 3.8) appears to be insufficiently detailed to verify if the 
algorithm indeed enforces admissibility, or, if inadmissible solutions are altogether an issue 
in that context. 

In many applications, e.g. gene identification, the pointwise (e.g. nucleotide level) er- 
ror rate is not necessarily the main measure of accuracy, hence the constrained PMAP 
need not be an ultimate answer. Together with the above problem of atypicality of MAP 
paths, this has been addressed by moving from single path inference towards envelops 
(Holmes and Durbin, 1998). Thus, for example, in computational biology the most common 
approach would be to aggregate individual states into a smaller number of semantic labels 
(e.g. codon, intron, intergenic). In effect, this would realise the notion of path similarity by 
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mapping many "similar" state paths to a single label path, or annotation (Brejova et al., 
2008; Fariselli et al., 2005; Kail et al., 2005; Krogh, 1997). However, this leads to the prob- 
lem of multiple paths, which in many practically important HMMs renders the dynamic 
programming approach of the Viterbi algorithm NP-hard (Brejova et al., 2007). Unlike the 
Viterbi/MAP decoder, the PMAP decoder handles annotations as easily as it does state 
paths, including the enforcement of admissibility (Kail et al., 2005). A number of alter- 
native heuristic approaches are also known in computational biology, but none appears to 
be fully satisfactory (Brejova et al., 2008). Evidently, mapping optimal state paths to the 
corresponding annotations need not lead to optimal annotation and can actually give poor 
results (Brejova et al., 2007). Overall, although the original Viterbi decoder has still been 
the most popular paradigm in many applications, and in computational biology in par- 
ticular, alternative approaches have demonstrated significantly higher performance, e.g., 
in predicting various biological features. For example, (Krogh, 1997) suggested the 1-best 
algorithm for optimal labelling. More recently, (Fariselli et al., 2005) have demonstrated 
PVD to be superior to the 1-best algorithm, and not surprisingly, to the Viterbi and PMAP 
decoders, on tasks of predicting membrane proteins. 

A starting point of this paper is that restricting the PMAP decoder to paths of pos- 
itive probability is but one of numerous ways to combine the useful features of the MAP 
and PMAP path estimators. Indeed, as a sensible remedy against vanishing probabilities, 
in his popular tutorial (Rabiner, 1989) Rabiner briefly mentions maximization of the ex- 
pected number of correctly decoded (overlapping) blocks of length two or three, rather than 
single states. With k > 1 and y T (k) being the block length and corresponding path esti- 
mate, respectively, this approach yields Viterbi inference as k increases to T (with y (1) 
corresponding to PMAP). Therefore, this approach could be interpreted as interpolating 
between the PMAP and Viterbi inferences. Intuitively, one might also expect p(x ,y (k)) 
to be strictly increasing with k. This is not exactly so as can be seen from Example A 
where p(x T ,y T (2)) = p(x T ,y T (l)) = 0. However, we find the idea of interpolation between 
the PMAP and Viterbi inferences worth a further investigation. To the best of our knowl- 
edge, the only published work which explicitly proposes a solution to such interpolation is 
(Brushe et al., 1998). The approach of (Brushe et al., 1998) is algorithmic, directly based 
on continuous mappings, and also deserves an analysis which we present in Subsection 4.2. 

1.3 Organization of the rest of the paper 

In this paper, we consider the segmentation problem in the more general framework of 
statistical learning. Namely, we consider sequence classifier mappings 

g : X T -)• S T 

and optimality criteria for their selection. In Section 2, criteria for optimality of g are 
naturally formulated in terms of risk minimization whereby R(s T \x T ), the risk of s T , derives 
from a suitable loss function. In Section 3, we consider families of risk functions which 
naturally generalize those corresponding to the Viterbi and PMAP solutions (Subsection 
2.1). Furthermore, as shown in Section 4, these risk functions define a family of path 
decoders parameterized by an integer k with k = 1 and k — > oo corresponding to the PMAP 
and Viterbi cases, respectively (Theorem 4). We also show the close connection between 
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the aforementioned family of decoders and the Rabiner /c-block approach. If needed, then 
the new family of decoders can easily be embedded into a yet wider class with a principled 
criterion of optimality. 

All these decoders (classifiers) would only be of theoretical interest if they could not 
be easily calculated. In Section 3, we show that all of the newly defined decoders can be 
implemented efficiently as a dynamic programming algorithm in the usual forward-backward 
manner with essentially the same (computational as well as memory) complexity as the 
PMAP or Viterbi decoders (Theorem 3). 

2. Risk-based segmentation 

Given a sequence of observations x , we define the (posterior) risk to be a function 

R{-\x T ) : S T ^[0,oo]. 
Naturally we seek a state sequence with minimum risk: 

q (x ) := arg min R(s \x ). 

V ; sTeST V ' 

Following the statistical decision and pattern recognition theories, the classifier g* will be 
referred to as the Bayes classifier (relative to risk R). Within the same framework, the risk 
is often specified via a loss-function 

L:S T x S T -)• [0,oo], 

interpreting L(a ,b ) as the loss incurred by the decision to predict b when the actual 
state sequence was a T . Therefore, for any state sequence s T € S T , the risk is given by 

R(s T \x T ):=E[L(Y T ,s T )\X T = x T }= ]T L(a T , s T )p(a T \x T ). 

a T £S T 

2.1 Standard path inferences 

The most popular loss function is the so-called symmetrical or zero- one loss L^ defined as 
follows: 

T („t h T, _ I 1, if a T + b T ; 

Loo[a > b >-{ 0, iia T = b T . 

We shall denote the corresponding risk by R^. With this loss, clearly 

Roo(s T \x T ) = P(Y T / s T \X T = X T ) = 1- p(s T \x T ), (2) 

thus i?oo(-|x r ) is minimized by a Viterbi path, i.e. a sequence of maximum posterior 
probability. Let v(-; oo) stand for the corresponding classifier, i.e. 

v(x ; oo) := arg max pis \x ), 

S T €S T 

with a suitable tie-breaking rule. 
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Evidently, Viterbi paths also minimize the following risk 

Roo(s T \x T ) := -I logp(s T |x T ). (3) 

It can actually be advantageous to use the log-likelihood based risk (3) since, as we shall 

see later, it leads to various natural generalizations (Sections 3 and 4). 

When sequences are compared pointwise, it is common to use additive loss-functions of 

the form 

1 T 
L 1 (a T ,b T ) = -Y,Kat,bt), (4) 

i=l 

where l(at,bt) > is the loss associated with classifying the t-th element at as bt- Typically, 
for every state s, l(s, s) = 0. It is not hard to see that, with L\ as in (4), the corresponding 
risk can be represented as follows 



Ri{s T \x T ) = ^J2 R ^\ xT ) 



t=i 



where Rt(s\x ) = ^2 a ^s K a ^ s )Pti a \ x )• Most commonly, I is again symmetrical, or zero- 
one, i.e. l(s, s') = I{ s ^,s'}, where Ia stands for the indicator function of set A. In this case, 
L\ is naturally related to the Ramming distance (Carvalho and Lawrence, 2008). Then also 
Rt(st\x T ) = 1 — p t {st\x T ) so that the corresponding risk is 

^i(* r k T ):=l-^X>N* T ). (5) 

Let v (•; 1) stand for the Bayes classifier relative to the i?i-risk. It is easy to see from 
the above definition of R\, that v(-; 1) delivers PMAP paths, which clearly minimize the 
expected number of misclassification errors. In addition to maximizing ^2 t= ±Pt{st\x ), w 
also maximizes the pseudolikelihood Y[t=iPt(. s t\ xT )i an d therefore minimizes the following 
log-pseudolikelihood risk 



1 T 
R 1 (s T \x T ) := --J2log Pt (st\x T ). (6) 

t=i 

2.2 Generalizations 

Recall (Subsection 1.2) that PMAP paths can be of zero probability (i.e. not admissible). 
To ensure admissibility, i?i-risk can simply be minimized over the admissible paths: 

T 

min R\(s \x ) 44> max >^ pAsAx ). (7) 

s T :p(s T \x T )>0 s T :p(s T \x T )>of^ 

Assuming that pt(j\x T ), 1 < t < T, j € S, have been precomputed (by the classical 
forward-backward recursion (Rabiner, 1989)), the solution of (7) can be easily found by a 
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Viterbi-like recursion (8) 

h(j) ■= Pi(j\x T ) rj , VjeS, (8) 

5 t+ i(j) := maxS t (i)rij + Pt+i(j\x T ), for t = 1, 2, . . . ,T - 1, and Vj € 5, 

where rj := I{tt->o}3 r ij := ^fe >o}- The recursion (8) is also equivalent to 

Si(j) ■= Pi(j\x T ) + log rj , Vje 5, (9) 

<5m(J) : = max((5 t (i) +logr ij ) +p i+ i(i|x T ) for t = 1,2,...,T- 1, and Vj G 5. 

However, in the presence of path constraints, minimization of the i?i-risk is no longer 
equivalent to minimization of the .Ri-risk. In particular, the problem (7) is not equivalent 
to the following problem (posterior- Viterbi decoding) 



T 

Ri(s T \x T ) & max Vlogp t ( St |x T ). (10) 

s T :p(s T \x T )>0 s T :p(s T \x T )>0~" 



mm 

o-i ■/nf e-l \ry-l 



A solution to (10) can be computed by a related recursion given in (11) below 

8i(j) ■= Pi(j\x T )rj, VjeS, (11) 

5 t+ i(j) := max.S t (i)rij x p t+1 (j\x T ), for t = 1,2, . . . ,T - 1, and Vj e S. 

i 

Recursion (11) is clearly equivalent to 

<5i(j) := logpi(j> T ) + logr,-, Vj e 5, (12) 

8t+l(j) '■= max((5j(i) + logrjj) +\ogp t+ i(j\x T ), for t = 1,2, . . . ,T - 1, Vj E S. 

i 

Although admissible minimizers of Ri and ^i risk are by definition of positive probability, 
this probability might still be very small. Indeed, in the above recursions, the weight r^- 
is 1 even when pij is very small. We next replace rij (rj) by the true transition (initial) 
probability p^ (ttj) in minimizing the ^i-risk (i.e. maximization of Y\t=i Pt( s t\ x )) ■ Then 
the solutions remain admissible and now also tend to maximize the prior path probability. 
With the above replacements, recursions (11) and (12) now solve the following seemingly 
unconstrained optimization problem (see Theorem 3) 

T 

max | y^log pt(st) + logp(s r ) <5 min Ri(s T \x T ) + h(s T ) , (13) 

s t L i=i J 8 T L J 

where the penalty term 

h(s T ) =-±\ogp(s T ) =:i?oo( S T ) (14) 

is the prior log-likelihood risk which does not depend on the data. The thereby modified 
recursions immediately generalize as follows: 

S l (j) := log Pl (j\x T ) + CIogTTj, VjES, (15) 

St+i(j) ■= max (5 t (i) + C\ogpij) + logp t+ i(j|x T ) for t = 1, 2, . . . ,T - 1, V j E S, 
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solving 



nun 



i?i(s J |x 7 ) + C/i(s J 



(16) 



where C > is a regularization constant and h(s i ) = Roo{s ) (see Section 3 and Theorem 
3). Then, PVD, i.e. the problem solved by the original recursions (11) and (12), can be 
recovered by taking C sufficiently small. (Alternatively, the PVD problem can also be 
formally written in the form (16) with C = oo and h(s ) given, for example, by I{ p ( s T )=o}-) 

What if the actual probabilities p^ (ttj ) were also used in the optimal accuracy/PMAP 
decoding, i.e. optimization (8)- (9)? It appears more sensible to replace the indicators r^ 
[tj) with pij (ttj) in (9) (and not in (8)). This solves the following problem: 



max 



[^2 Pt(s t ) + log p(s 7 



-£4> mm 



R 1 (s T \x T ) + R 00 ( y s T ) 



(17) 



A more general problem can be written in the form 



mm 



Riis 1 {x 1 ) + Ch{s A 



(18) 



where h is some penalty function (independent of the data x T ). Thus, the problem (7) 
of optimal accuracy/PMAP decoding over the admissible paths is obtained by taking C 
sufficiently small and h(s T ) = R OQ (s T ). (Setting C x h(s T ) = oo x Ir / s t\ =0 i also reduces 
the problem (18) back to (7).) 

3. Combined risks 

Motivated by the previous section, we consider the following general problem 



mm 



„T\T 



dR^s 1 \x J ) + C 2 «oo(s J W ) + C 3 #iK ) + C^R^s 1 



(19) 



where Cj > 0, i = 1, 2, 3, 4, Yli=i C* > 0- This is also equivalent to 



mm 



J|J\ 



JT T 



dR^s 1 \x 2 ) + C 2 Roo{s 1 ,x' ) + CsR^s 1 ) + C^R^s 1 



(20) 
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where 



1 T 

R 1 (s T \x T ) = - 7 f;'^2^ogp t (st\x T ), recalling (6), 



t=i 
R 00 (s T ,x T ) := -^\ogp{x T ,s T ) 



1 T 

- -[log p(s T ) + ^log f st {x t )}, 

t=i 



T-l T 

= [log7r Sl + J^logp fltflt+1 +5^1og/ St (x t )], 



T 

Roo{s T \x T ) = —logp(s T \x T ), recalling (3), 
= Roo{s T ,x T ) - —logp(x T ), 



1 

R 1 (s T ):=--^2logp t (s t ), (21) 



T 

t=l 

T 

h 

T-l 



t=l 
1 



Roo{s ) = -— logpis 1 ), recalling (14), 



= --[l0g7T Sl + Y^ fo&P'ft+l]- ( 22 ) 

i=l 

The newly introduced risk Ri(s T ) is the prior log-pseudo-likelihood. Evidently, the com- 
bination C\ = C3 = C4 = corresponds to the MAP/Viterbi decoding; the combination 
C2 = C3 = C4 = yields the PMAP case, whereas the combinations C\ = C2 = C3 = and 
C\ = C2 = C4 = give the maximum a priori decoding and marginal prior mode decoding, 
respectively. The case C2 = C3 = subsumes (16) and the case C\ = C3 = is the problem 



mm 



Roots 1 Ix^ + CRoois 1 ) . (23) 



Thus, a solution to (23) is a generalization of the Viterbi decoding that allows one to 
suppress (C > 0) contribution of the data. It is important to note that with C2 > every 
solution of (19) is admissible. No less important, and perhaps a bit less obvious, is that 
Ci,Ci > also guarantees admissibility of the solutions, as stated in Proposition 1 below. 

Proposition 1 Let C\,C± > 0. Then, for almost every realization (x T ,y T ) of the HMM 
process (X , Y ), the minimized risk (19) is finite and any minimizer s is admissible, i.e. 
satisfies p(s T \x T ) > 0. 

Proof Without loss of generality, assume C2 = C3 = 0. Suppose the problem has no 
finite solution. Then for any s T with p(s T ) > 0, we would have some t, 1 < t < T, such 
Pt(st\x ) = 0. This would imply that pix , s ) = for all s with p(s ) > 0, contradicting 
the hypothesis that p(x ,y T ) > 0. Now, suppose that s is a minimizer of (19) but 
p(s T \x T ) = 0. Since p(s T ) > 0, we would have some t, 1 < t < T, such f St {xt) = 0. This 
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would imply (Xt(st) = (1), and subsequently that pt(st\x T ) = and Ri(s T \x T ) = oo, 
contradicting optimality of s T . ■ 



Remark 2 Thus, note that the Posterior- Viterbi decoding (Fariselli et at, 2005) can be 
obtained by either setting C3 = C 4 = and taking < C 2 -C C\, or setting C2 = C3 = 
and taking < C 4 <C C\. 

If the smoothing probabilities Pt{j\x T ), t = 1, . . . , T and j £ S, have been already computed 
(say, by the usual forward-backward algorithm), a solution to (19) can be found by a 
standard dynamic programming algorithm. Let us first introduce more notation. For every 
t € 1, . . . , T and j S S, let 

9t(j) ■= C\ logp t (j\x T ) + C 2 log fj(x t ) + C 3 log pj (j). 

Note that the function gt depends on the entire data x T . Next, let us also define the 
following scores 

S X (J) := Cx log Pl (j\x T ) + (C 2 + C 3 + C 4 ) log ttj + C 2 log ^-(xi), Vj € 5, 
<5 t+ i (j) := max (5 t (i) + (C 2 + C 4 ) logpy ) + fft+ i (j) (24) 

for t = 1,2, ...,T-1, and Vj G 5. 

Using the above scores ^(j) and a suitable tie-breaking rule, below we define the back- 
pointers it(j), terminal state it, and the optimal path s (ix)- 

i t (j) := argm&x[5 t (i) + (C 2 + C 4 ) logpij], when i = 1, . . . , T - 1; 

iGS 

ix '■= argmax^T^)- (25) 

~JfA\ ._ / *lC?)> Whel1 l = l ' 



s (J) - I (|t-i (it _ l(i)))i ) when t = 2 ',...,T. (26) 

The following theorem formalizes the dynamic programming argument; its proof is standard 
and we state it below for completeness only. 

Theorem 3 Any solution to (19) can be represented in the form s (it) provided the ties 
in (25) are broken accordingly. 

Proof With a slight abuse of notation, for every s t G S 1 , let 

t 



U ( st ) = Yl iSu(su) + (C 2 + C 4 ) logp Su _ 16 



u=l 



where so := and pq s := it s . Hence, 

-T[C 1 ^ 1 (s T |x T ) + C 2 Roo(s T , x T ) + C 3j Ri(s t ) + C7 4 ^oo(a T )] = U(s T ) 
and any maximizer of U(s T ) is clearly a solution to (19) and (20). 
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Next, note that Si(j) = U(j) for all j € S, and that 

U(s t+l ) = Utf) + (C 2 + C 4 ) log PstSt+1 +g t+1 ( St+1 ), 
for t = 1, 2, . . . , T — 1 and also s'gS 1 . By induction on i, these yield 

S t (J) = max ?7(s*) 

for every £ = 1,2, ... , T and for all j £ 5. Clearly, every maximizer s of U(s ) over 
the set >S T must satisfy st = ir, or, more precisely §t € argmaxjgg^rO), allowing for 
non- uniqueness. Continuing to interpret argmax as a set, recursion (24) implies recursions 
(25) and (26), hence any maximizer s can indeed be computed in the form s (§t) via the 
forward (recursion (25))-backward (recursion (26)) procedure. ■ 



Similarly to the generalized risk minimization of (19), the generalized problem of accuracy 
optimization (18) can also be further generalized as follows: 



mm 



dRxis 1 {x 1 ) + C 2 Roo(s 1 {x 1 ) + CMs 1 ) + C 4j Roo(s j ) , (27) 



where risk 

T , T 



R^) := ^£P(y* + s t ) = 1 - ±5>(««) (28) 

i=l t=l 

is the error rate relative to the prior distribution. This problem apparently can be solved 
by the following recursion 

Si(j) ■= C lPl (j\x T ) + (C 2 + C^logiTj + Qjlog/^i) + Csttj, Vj e 5, 
<5 t+ i(j) := max (6 t (i) + (C 2 + C 4 ) fog Pij )+g t +i(j), (29) 

where now 

gt(s) = C lPt {s\x T ) + C 2 log /.(art) + C 3 pt(i)- 

As in the generalized posterior- Viterbi decoding (19), here C2 > also implies admissibility 
of the optimal paths. However, unlike in (19), 61,64 > is not sufficient to guarantee 
admissibility of the solutions. 

4. Other approaches to hybridization of PMAP and Viterbi 

We have been discussing a set of related ideas which allow us to balance path accuracy 
against path probabilities. Next, we extend this discussion by presenting a couple of notably 
different approaches. 
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4.1 The /c-block Posterior- Viterbi decoding 

Recall (Subsection 1.2) that Rabiner's compromise between MAP and PMAP is to maximize 
the expected number of correctly decoded pairs or triples of (adjacent) states. With k being 
the length of the overlapping block (k = 2, 3, . . .) this means to minimize the conditional 

risk 

T-k+l 



R k (s T \x T ) := £ p(4 +k - l \x T ) (30) 



t=i 

which derives from the following loss function: 

T-k+l 

L k (y T ,s T ):= Y^ 1 {s t + k - 1 ^y t + k - 1 y ( 31 ) 

4=1 

Obviously, for k = 1 this gives the usual R± maximization - the PMAP decoding - which 
is known to fault by allowing inadmissible paths. It is natural to think that minimizers 
of R k (s T \x T ) "move" towards Viterbi paths "monotonically" as k increases to T. Indeed, 
when k = T, minimization of R k (s \x ) (30) is equivalent to minimization of Roo(s \x ) 
achieved by the Viterbi decoding. However, as Example A shows below, minimizers of (30) 
are not guaranteed to be admissible for k > 1, which is a drawback of using the loss L k 
(31). 

We now show that this drawback is easily overcome when the sum in (30) is replaced 
by the product. Certainly, these problems are not equivalent, and in particular with the 
product in place of the sum the fc-block idea works well. Namely, the longer the block, the 
larger the resulting path probability, which is also now guaranteed to be positive already 
for k = 2. Moreover, this gives another interpretation of the risks Ri(s T \x T ) + CR 00 (s T \x T ) 
(see also Remark 2 above) and, though perhaps less interestingly, the prior risks R\(s ) + 

CR 00 ( S T ). 

Let A; be a positive integer. For the time being, let p represent any first order Markov 
chain on S T , and let us define 

u k (s T ) ■= n p(*g$cf), ^(* T ) - -^u k ( S T ). 



Thus 



where 



U k (s T ) = XJ\ ■ L/* • Ul 



XJ k 2 



p(s 1 )---p(s k 1 - 2 )p(s k 1 - 1 ) 

p{s\)p{s k 2 + 1 )---p{s T T -_\)p{s T T „ k+1 , 

p{sT- k+2 )p{ s T~k+z) ■ ■ -P(ar). 



Thus, R k is a natural generalization of R± (introduced first for the posterior distribution in 
(6)) since when k = 1, R k = R\. 
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Theorem 4 Let k be such that T > k > 1. Then the following recursion holds 

Rk(s T ) = Roo(s T ) + R k -i(s T ), Vs T G S T . 
Proof Note that 

Next, for all j such that j + k < T, the Markov property gives 



P{s)+i) = P(Sj+k\sj+k-l)p(Sj +1 ) 



and 



u2P(4~k+2) = p(4)p(4 +1 ) ■ ■■p(4-k+i)p(4-k + 2) = 

p{s k \s k ^ 1 )p{s'l~ 1 )p{s k+1 \s k )p(s2) ■ ■ ■ p(s T \sT-l)p{sTZl +1 )p(sT-k+2) 

p{s k \sk-i)p{s k+ i\s k ) ■ • •p(st|st-i)p(si" 1 ) • • ■ p(s^zl +1 )p(sT-k+2) -~ 
p{sk\sk~i) ■ ■■p(st\st-i)U%~ 1 . 



Hence, 



Uk(s T ) = UtM4~ 1 Msk\sk-i)---p(s T \sT-i,x T )Ut 1 Ut 1 

= p^ut'ut'ut 1 = pCO^-iCO- 

The second equality above also follows from the Markov property. Taking logarithms on 
both sides and dividing by —T completes the proof. ■ 

Now, we specialize this result to our HMM context, and, thus, p(s ) and p(s \x ) are again 
the prior and posterior hidden path distributions. 

Corollary 5 Let k be such that T > k > 1. For all paths s G S the prior risks R k and 
Roo satisfy (32). For every x T G X T and for all paths s T G S T , the posterior risks R k and 
Roo satisfy (33). 

Rk(s T ) = Roo(s T ) + R k -i(s T ), (32) 

Rk(s T \x T ) = R OD {s T \x T ) + R k _ 1 (s T \x T ). (33) 

Proof Clearly, conditioned on the data x , Y T remains a first order Markov chain (gener- 
ally inhomogeneous even if it was homogeneous a priori). Hence, Theorem 4 applies. ■ 

Below, we focus on the posterior distribution and risks, even though the following would 
readily extend to any first order Markov chain. 

Let v(x T ;k) be a classifier that minimizes R k (s T \x T ). Thus, 

x ; k) = arg max Uds \x ) = arg min Rk\s \x ). 

We refer to such classifiers as k-block PVD or k-block PMAP as they naturally extend 
v(-; 1), the PMAP/optimal accuracy/posterior decoder (Section 2.1). To be consistent with 
applications (Fariselli et al., 2005), the term '/c-block posterior- Viterbi decoding', however, 
is perhaps more accurate given the use of the product-based risk ^i(s T |x T ) as opposed to 
Ri(s T \x T ). 

Now, we present some properties of the new risks and decoders. 
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Corollary 6 For every x € X , and for every s £ S , we have 

R k {s T \x T ) = (k-l)R 00 {s T \x T ) + R 1 (s T \x T ) for every k> 1,2, ... ,T. (34) 

v(x ; k) is admissible for every k > 1,2, ... ,T. 

R 00 {v{x T ;k)\x T )<R 00 {v{x T ;k-l)\x T ) for every k > 1,2, ... ,T. (35) 

R 1 (v(x T ;k)\x T )>R 1 (v(x T ;k-l)\x T ) for every k > 1,2, ... ,T. (36) 

Proof Equation (34) follows immediately from equation (33) of Corollary 5. Inequality 
(35) follows from inequalities (37) below 

R 00 (v(x T ;k-l)\x T )-R 00 (v(x T ;k)\x T ) > (37) 

R k . 1 (v(x T ;k)\x T )-R k _ 1 (v(x T ;k-l)\x T ) > 0, 

which in turn follow from equation (33) of Corollary 5. Also, equation (34) implies that for 
every k > 2, 

(k - 2)R 00 {v{x T ; k - l)\x T ) + Ri(v(x T ; k - l)\x T ) < 
(k - 2)R 00 (v(x T ; k)\x T ) + Ri(v(x T ; k)\x T ), 

which, together with inequality (35), implies (36). ■ 

Inequality (35) means that the posterior path probability p(v(x ; k)\x ) increases with k. 
Equation (34) is also of practical significance showing that v(x ; k) is a solution to (19) 
with C\ = 1, C2 = k — 1, C3 = C4 = 0, and as such can be computed in the same fashion 
for all k (see Theorem 3 above). 

Thus, increasing k increases -Ri-risk, i.e. decreases the product of the (conditional) 
marginal probabilities of states along the path v(x ,k). Inequalities (35) and (36) clearly 
show that as k increases, v(-;k) monotonically moves from v(-;l) (PMAP) towards the 
Viterbi decoder, i.e. v(-; 00). However, the maximum block length is k = T. A natural way 
to complete this bridging of PMAP with MAP is by embedding the collection of risks R k 
into the family R a via a = 1/k € [0, 1]. Thus, (34) extends to 

R a (s T \x T ) := (1 - a)iJoo(s T |x T ) + aRi{s T \x T ) (38) 

with q = and a = 1 corresponding to the Viterbi and PMAP cases, respectively. Given 
x and a sufficiently small a (equivalently, large k), v(x , k), the minimizer of R a {s \x ) 
(38) (or, the right hand side of (34)) would produce a Viterbi path ^(a^oo) (since S T is 
finite). However, such a (and k) would generally depend on x , and in particular k may 
need to be larger than T, i.e. v{x , T) may be different from v(x , 00). At the same time, 
we clearly have 

iU?;(x T ;oo)|x T ) < Roo(v(x T ;k)\x T ) < R 00 (v(x T ; oo)|x r ) + M<^^oo)\x T ) ^ ^ 

on which we comment more in Section (5) below. 
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4.2 Algorithmic approaches 

An alternative that does not involve the risk functions is to simply transform the forward 
and backward variables at(i) and /3t(j) defined in (1). Consider, for one example, the 
recursively applied power transformations given in (40) below 



ai(j;q) 



oit+i(j;q) 



Ptfrq) 

PriJ^q) 



ocxij) 

^2(a t (i;q)pi 



(40) 



i€S 



fjixt+i), l<t<T 



J2 (Pjifi( x t+i)Pt+i{i; q)f 

M) = 1. 



Kt<T 



(41) 



Clearly, a t (j; 1) = a t (j) and /3 t (j; 1) = /3 t (j), for all j G S and all t = 1,2, . . . ,T. Thus, 
q = 1 leads to the PMAP decoding. Using induction on t and continuity of the power 
transform, it can also be seen that the following limits exist and are finite for all j G S and 



all t = 1,2, ... ,T: lim^c 
1< t < T, 



,a t (j;q) =: a t (j, oo) and lim^^ f3 t (j; q) =: /3 t (j;oo), where for 



and therefore, any Viterbi path v(x 



max p(x , s ), 

s t :s t =j 



max 



p(x 



a 1 W 
t+l' 6 t+ll 2 t 



T. 



OO 



(v±, . . . , vt) has the following property: 



v t = argmax{a t (j; oo)/3 t (j; oo)}. 



(42) 



This has been already been pointed out by (Brushe et al., 1998), who, to the best of our 
knowledge, (Brushe et al., 1998) have been the only group to publish on the idea of hy- 
bridization of the PMAP and Viterbi decoders via a continuous transformation. Ignoring 
potential non-uniqueness of Viterbi paths, (Brushe et al., 1998) state, based on (42), that 
the Viterbi path can be found symbol-by- symbol. Certainly, when Viterbi paths are non- 
unique, symbol-by-symbol decoding based on (42) can produce suboptimal, and in principle 
inadmissible, paths. In contrast to Viterbi, non-unique PMAP paths (in the absence of con- 
straints) can certainly be found symbol-by-symbol. 

Note also that for their hybrid of PMAP with Viterbi, (Brushe et al., 1998) use the 
following transformations: 



1 + (N 



^Infl^wJ, 



(43) 



where N = K (in our notation) and dj (/i) are required to be continuous on [0, oo) with 
finite limits as fi — >■ oo. These dl- are then substituted for by appropriate expressions in 
terms of the recursively transformed forward and backward variables. 

The following points with regard to this hybridization idea have also motivated our 
present work: 
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1. Transformations (43) appear to be somewhat more sophisticated than the power trans- 
forms (40). It appears that the only reason explicitly stated in (Brushe et al., 1998) 
for making their choice of transformation is to deliver the correct limits (in that case 
PMAP with /i = and Viterbi with /i = oo). Besides (43) and (40), there are other 
(single parameter) transformations meeting this condition. 

2. Implicitly, the authors of (Brushe et al., 1998) do recognize the usual problem of 
numerical underflow, but somehow appear to suggest that the rescaling trick 



log(e a + e h ) = max{a, 6} + log ( 1 + e 



would be sufficient to resolve this problem when computing their transformed vari- 
ables. We could not experimentally confirm this optimism even with simple models. 
In fact, without addressing the scaling issue in full, the expressions (43) and (40) 
are short of defining practically meaningful path decoders. True, it is not difficult 
to renormalize these, or similar, expressions while preserving their limiting behav- 
ior. However, unlike in the original, i.e. untransformed, forward-backward algorithm, 
renormalization of the transformed forward and backward variables will alter the orig- 
inal decoders for intermediate values of the tuning parameter. This can already be 
suspected by examining equations (44) below 



oii(j;q) :-- 

&t+i(j;q) ■-- 

Ur,o) ■- 

P T (j;q) ■-- 



ai(i)/^ ai ^) 



(44) 



les 



[Eies( 5: t{^ ( l)Pij) q ] q fj( x t 



+i, 



J2ies [£* e s (««(*; i)pu) 9 ] q M x t+i) 



EiG5 [Pjifi(xt+i)Pt+i(i;> 



E/es Eiss («t(»; q)pu) q ] q fi{xt+x) 

Mi) = i. 



1 < t < T 



Kt <T 



(45) 



which implement the usual rescaling (Rabiner, 1989). 

3. Moreover, algorithmically defined estimators are generally hard to analyze rigorously 
(Winkler, 2003, pp. 25, 129-131). In our context, optimization criteria for inter- 
mediate members of the above interpolating families are indeed not clear, making it 
difficult to interpret the corresponding decoders. This might be discouraging should 
such decoders be included in more complex inference cycles (i.e. when any gen- 
uine model parameters are estimated as well, e.g. Viterbi Training (Koski, 2001; 
Lember and Koloydenko, 2008, 2010)). 

4. Other recursion schemes (for example, cf. (Koski, 2001, pp. 272-273) for Derin's 
formula) can surely be experimented with in a similar manner. However, now more 
than ten years after appearance of (Brushe et al., 1998), we find that the value of any 
such interpolation is yet to be demonstrated. 
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5. As already mentioned above, the symbol-by-symbol implementation of the transform- 
based hybrids is problematic when the solution is non-unique and full path probability 
is a factor. 

5. Asymptotic risks 

Given a classifier g and a risk function R, the quantity R(g(x T )\x T ) evaluates the risk when 
g is applied to a given sequence x . When g is optimal in the sense of risk minimization, then 
R{g{x T ),x T ) = min s T R(s T \x T ) =: R(x T ). We are also interested in the random variables 
R(g(X ), X ). Thus, in (Kuljus and Lember, 2010), convergence of several risks of the 
Viterbi decoding has been considered. Based on the asymptotic theory of Viterbi processes 
v(X°°;oo) (Lember and Koloydenko, 2008, 2010), it has been shown that under fairly gen- 
eral assumptions on the HMM, the random variables Ri(v(X T ; oo)|X T ), Ri(v (X T ; oo)|A" T ), 
R 00 {X T ) = J R 00 (u(X T ; oo)|A T ) as well as R OQ (v(X T ; oo)), R 1 (v(X T ; oo) and Ri{v{X T ; oo)) 
all converge to constant limits, a.s.. Convergence of these risks obviously imply convergence 
of 

C 1 R 1 (v(X T ;oo)\X T ) + C 2 R 00 (v(X T ;oo)\X T ) + C 3 R 1 (v(X T ;oo)) + C 4 R 00 (v(X T ;oo)), 

and 

C 1 R 1 (v(X T ;oo)\X T ) + C 2 R 00 (v(X T ;oo)\X T ) + C 3 R 1 (v(X T ;oo)) + C 4 R 00 (v(X T ;oo)), 

the risks appearing in the generalized problems (19) and (27), respectively. Actually, 
convergence of R 00 (v(X T ),X T ) is also proved (and used in the proof of convergence of 
R oc (v(X T ; oo)|A T )). Hence, the risk in (20), evaluated at Viterbi paths, converges as well. 

The limits - asymptotic risks - are (deterministic) constants that depend only on the 
model and evaluate the Viterbi inference. For example, let Ri(v(oo)) be the limit of 
Ri(v(X ;oo)\X ), which is the asymptotic misclassification rate of the Viterbi decoding. 
Thus, for big T, the Viterbi decoding makes about TRi(v(oo)) misclassification errors. The 
asymptotic risks might be, in principle, found theoretically, but as the limit theorems show, 
the limiting risks can also be estimated by simulations. 

In (Lember, 2009), it has been also shown that under the same assumptions R\(X T ) = 
Ri(v(X T ; 1)\X T ) converges to a constant limit, say R\(v(l)). In (Kuljus and Lember, 
2010), Ri(X T ) = i?i(u(X T ; l)\X T ) has been also shown to converge. Clearly i?i(t>(oo)) > 
Ri(v(l)), and even if their difference is small, the number of errors made by the Viterbi 
decoder in excess of PMAP in the long run can still be found significant. 

Presently, we are not aware of a universal method for proving the limit theorems for these 
risks. Convergence of the risks of the Viterbi decoding is possible due to the existence of 
the so-called Viterbi process (see (Koloydenko and Lember, 2008; Lember and Koloydenko, 
2008, 2010)) that has nice ergodic properties. The question whether infinite PMAP 
processes have similar properties, is still open. Therefore, convergence of R\{X ) = 
R\(v(X , 1)\X ) was proven with a completely different method based on the smooth- 
ing probabilities. In fact, all of the limit theorems obtained thus far have been proven 
with different methods. We conjecture that these different methods can be combined so 
that convergence of the minimized combined risk (19) or (27) could be proven as well. In 
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summary, as mentioned before, thus far convergence of the minimized combined risks has 
been obtained for trivial combinations only, i.e. with three of the four constants being zero. 
Note that while convergence of the intermediate case 

min[i?i(s T |X T ) + CR 00 {s T \X T )] 

S T 

with its minimizer v(x T ;C) is an open question, (39) gives 

0<R oo (v(x T ;C)\x T )-R oo (v(x T ;oo)\x T ) < M<* T "^ 



C 

This, together with the a.s. convergence of Ri(v(X T ; oo)\X T ), implies that in the long 
run, for most sequences x T , R oc (v(x T ;C)\x T ) will not exceed Roo(v(x T ; oo)\x T ) by more 
than limr^oo R\(v (X T ; oo)\X T )/C . Since this limit is finite, letting C increase with T, 
R oc (v(X T ;C(T))) obviously approaches hrnr_j, 00 R 00 (v(X T ; oo)) a.s., i.e. as the intuition 
predicts, the likelihood of v(X T ; C(T)) approaches to that of v(X T ; oo) 

6. Discussion 

Certainly, the logarithmic risks (3), (6), (14), (21) on the one hand, and the ordinary risks 
(2), (5), R oa (s T ) = 1 —p{s T ), (28), on the other, can be respectively combined into a single 
parameter family by, for example, the power transformation as shown below. Let p for the 
moment be any probability distribution on S T . 

f_IV T M*t)P-l ;ffl-/n 

\-7TLilogPt(st) if/3 = 



1 p(s T )0-l 

V ' -±logp(s T ) if (3 = 



hV-^P 1 ' if/3 7^0 



Thus, our two generalized problems (19) and (27) are naturally members of the same family 
of problems: 



mm 



C 1 R 1 (s 1 \x 1 ;t3 1 ) + C2Roo(s 1 \x 1 ;P 2 ) + C 3 R 1 (s 1 ;f3 3 ) + C A R QO ( S 1 ;(3 4 ) , (46) 



where d > and j3i > 0, i = 1, 2, 3, 4, and 2i=i C* > 0- Clearly, the dynamic programming 
approach of Theorem 3 and (29) immediately applies to any member of the above family 
(46) with /3 2 = /? 4 = 0. 

Theorem 4 and Corollaries 5 and 6 obviously generalize to higher order Markov chains 
as can be seen below. 

Proposition 7 Let p represent a Markov chain of order m, 1 < m < T, on S T . Then for 
any s € S and for any k € {m, m + 1, . . .}, we have 

R k (s T ) = R m {s T ) + (k - m)R 00 (s T ). 
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Proof This is a straightforward extension of the proof of Theorem (4). ■ 

The present risk-based discussion of HMM path inference also naturally extends to 
the problem of optimal labelling or annotation (Section 1.2). Namely, the state space S 
can be partitioned into subsets Si, S2, •••, Sa, f° r some A < K, in which case X(s) 
assigns label A to every state s € S\. The fact that the PMAP problem is as easily 
solved over the label space A T as it is over S T has already been used in practice. Indeed, 
adding the admissibility constraint, (Kail et al., 2005) in effect average p t (st\x T ys within 
the label classes and then use recursions (8) to obtain the PMAP labelling, say, A (x ; 1 + ), 
of admissible state paths. This approach clearly corresponds to using the point loss l(s, s') = 



I 



{\(s)^X(s 



)} in (4) when solving min s T. p (, 



T \x T )>0 



Ri(s 



T| X T, 



(7). Importantly, our generalized 



problem (46) also immediately incorporates the above pointwise label-loss in either the prior 
Ri(-;/3^) or posterior risk i?i(-;/3i), or both. Since computationally these problems are 
essentially as light as (29) and since (Kail et al., 2005) report their special case to be useful 
in practice, we believe that the above generalizations offer yet more useful possibilities to 
practitioners. Note the different kinds of averaging corresponding to different values of (3 
to be used with the R\ risks: 



Pt(s;/3) oc < 



£ Ms') 



'es 



A(s) 



'A(»)l 



n ptW 



K s'£S 



A( S ) 




if p ± 0, 



, if p = 0. 



Certainly, the choice of the basic loss functions, inflection parameters j3{ and weights Cj 
of the respective risks, is application dependent, and can be tuned with the help of labelled 
data, using cross-validation. Finally, these generalizations are presented for the standard 
HMM setting, and work on extending them to more complex and practically more useful 
HMM-based settings (e.g semi-Markov, autoregressive, etc.) is underway. 

Appendix A. Example: Optimal recognition of pairs can still produce 
inadmissible solutions. 

Consider the following four-state MC transition matrix 

/ 4 2 2 \ 

1 4 112 

8 2 114 

\ 2 2 4 / 

Suppose observations x\,x-iiX^,x^ and the emission densities f s s = 1,2,3,4 are such that 

fs(xi) = f s (x4) = \ ' -r „ / o' ' fsfa) = fs{x 2 ) 



A>1, if a = 1; 
1, if a^l. 



Hence every admissible path begins and ends with 2. Thus, to simplify the notation, we 
assume without loss of generality that P(Yi = 2) = 1. Amongst the paths that begin 



20 



Generalized risk-based path inference in HMMs 



and end with 2, the paths whose probabilities are listed below, are the only ones of positive 
(prior) probability (the probabilities below are calculated up to the normalization constant): 



p(2,l 
p(2,l 
p(2,l 
p(2,2 
p(2,2 
p(2,2 
p(2,2 
p(2,3 
p(2,3 
p(2,3 
p(2,3 
p(2,4 
p(2,4 
p(2,4 



,2,2) 


,3,2) 


,4,2) 


,1,2) 


,2,2) 


,3,2) 


,4,2) 


,1,2) 


,2,2) 


,3,2) 


,4,2) 


,1,2) 


,2,2) 


,3,2) 



= p(2,2,l,2) 


oc 16 


= p(2,3,l,2) 


oc8 


= P(2,4,1,2) 


oc 16 


= p(2,l,2,2) 


oc 16 


oc 1 




= p(2,3,2,2) 


oc 1 


= p(2,4,2,2) 


oc 4 


= p(2,l,3,2) 


oc8 


= p(2,2,3,2) 


oc 1 


oc 1 




= p(2,4,3,2) 


oc8 


= p(2,l,4,2) 


oc 16 


= p(2,2,4,2) 


oc 4 


= p(2,3,4,2) 


oc8 



p(2, 1,2,2| 


p(2, 1,3, 2| 


p(2, 1,4,2| 


p(2,2,l,2| 


p(2,2,2,2| 


p(2,2,3,2| 


p(2,2,4,2| 


p(2,3,l,2| 


P(2,3,2,2| 


P(2,3,3,2| 


p(2,3,4,2| 


P(2,4,l,2| 


P(2,4,2,2| 


P(2,4,3,2| 



oc 16,4 

oc8^ 

oc 16A 

= p(2, l,2,2|x 4 ) oc 16A 

oc 1 

oc 1 

oc 4 

= p(2,l,3,2|x 4 ) oc8,4 

= p(2,2,3,2|x 4 ) oc 1 

oc 1 

oc8 

= p(2,l,4,2|x 4 ) oc 16A 

= p(2,2,4,2|x 4 ) oc4 

= p(2,3,4,2|x 4 ) oc8 



For every pair si,s 2 of states and for every i = 1,2,3, let pa + \{si, s 2 ) '■= P(^ 
si, Yi_i_i = s 2 |x 4 ). Then, we have 



p 1]2 (2,l)=p3 )4 (l,2)cx4(L4, 

p 1)2 (2,2)=p 3 ,4(2,2)ocl6A + 6, 
p lj2 (2,3)=p 3 ,4(3,2)oc8^ + 10, 
p lj2 (2,4)=p 3 ,4(4,2)ocl6A + 12, 
P2,3(l,2)=p 2 , 3 (2,l)ocl6A 
P2,3(l,3)=p 2 ,3(3,l)cx8A, 



p 2 , 3 (l,4)=p 2 , 3 (4,l)ocl6A, 

P2, 3 (2,2)ocl, 

P2,3(2,3)=p 2 , 3 (3,2)ocl, 

p 2i3 (2,4)=p 2i3 (4,2)oc4, 

P2,3(3,3) oc 1, 

P2, 3 (3,4)=p 2i3 (4,3)oc8. 



Let 



W 2 {s A ) : = Pi, 2 (si, s 2 ) + p 2 , 3 (s 2 , S3) + P3,4(S3, Si). 
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Then we also have 



W 2 {2 
W 2 {2 
W 2 (2 
W 2 (2 
W 2 {2 
W 2 {2 
W 2 {2 
W 2 {2 
W 2 {2 
W 2 {2 
W 2 {2 
W 2 {2 
W 2 {2 
W 2 {2 
W 2 {2 
W 2 {2 



1,1 
1,2 
1,3 
1,4 

2,1 
2,2 
2,3 
2,4 
3,1 
3,2 
3,3 
3,4 

4,1 
4,2 
4,3 

4.4 



2) oc 40,4 + 40,4 = 80,4 



2 oc 40,4 + 16,4 + 16,4 + 6 

2 



72,4 + 6 
oc 40,4 + 8,4 + 8A + 10 = 56,4 + 10 
oc 40,4 + 16,4 + 16,4 + 12 = 72A + 12 
= W 2 (2, 1,2,2) oc72A + 6 
oc 16A + 6 + 1 + 16,4 + 6 = 32,4 + 13 
oc 16,4 + 6 + 1 + 8A + 10 = 24,4 + 17 
oc 16,4 + 6 + 4 + 16,4 + 12 = 32,4 + 22 
= W 2 (2, 1,3,2) oc 56,4 + 10 
= W 2 (2,2,3,2) oc 24,4 + 17 
oc 8,4 + 10 + 1 + 8A + 10 = 16,4 + 21 
oc 8,4 + 10 + 8 + 16,4 + 12 = 24,4 + 30 
= W 2 (2, 1,4,2) oc 72,4 + 12 
= W 2 (2,2,4,2) oc 32,4 + 22 
= 1^(2,3,4,2) oc 24,4 + 30 
= 32,4 + 24.. 



Hence, when A is sufficiently big, then 

argmaxV7 2 (s ) 



(2,1,1,2), 



but p(2, 1, 1, 2) = 0, i.e. the path is not admissible. 

The Viterbi paths here are (2, 1, 2, 2), (2, 2, 1, 2), (2, 1, 4, 2), (2, 4, 1, 2). Also note that, for 
every i = 1,2, 3,4, p 2 (i) = j+ 2 (2,i) and ps{i) = Pz,4,{h 2)- Thus (2, 1, 1, 2) is also the unique 
PMAP path. Minimization of -R 2 (s 4 |:r 4 ) over all s 4 here is equivalent to maximization of 
P2{s 2 )p 2 ${s 2 ,S3)pz(s3), and the optimal paths are (2,1,4,2) and (2,4,1,2). 
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